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Abstract 

We present a framework for the design of optimal assembly algorithms for shotgun sequencing 
under the criterion of complete reconstruction. We derive a lower bound on the read length and the 
coverage depth required for reconstruction in terms of the repeat statistics of the genome. Building 
on earlier works, we design a de Brujin graph based assembly algorithm which can achieve very 
close to the lower bound for repeat statistics of a wide range of sequenced genomes, including 
the GAGE datasets. The results are based on a set of necessary and sufficient conditions on the 
DNA sequence and the reads for reconstruction. The conditions can be viewed as the shotgun 
sequencing analogue of Ukkonen-Pevzner's necessary and sufficient conditions for Sequencing by 
Hybridization. 
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1 Introduction 

1.1 Problem statement 

DNA sequencing is the basic workhorse of mod- 
ern day biology and medicine. Since the se- 
quencing of the Human Reference Genome ten 
years ago, there has been an explosive advance 
in sequencing technology, resulting in several or- 
ders of magnitude increase in throughput and 
decrease in cost. Multiple "next-generation" se- 
quencing platforms have emerged. All of them 
are based on the whole-genome shotgun sequenc- 
ing method, which entails two steps. First, many 
short reads are extracted from random locations 
on the DNA sequence, with the length, num- 
ber, and error rates of the reads depending on 
the particular sequencing platform. Second, the 
reads are assembled to reconstruct the original 
DNA sequence. 

Assembly of the reads is a major algorithmic 
challenge, and over the years dozens of assem- 
bly algorithms have been proposed to solve this 
problem j28j . Nevertheless, the assembly prob- 
lem is far from solved, and it is not clear how 
to compare algorithms nor where improvement 
might be possible. The difficulty of comparing 
algorithms is evidenced by the recent assembly 
evaluations Assemblathon 1 [2] and GAGE [23], 
where which assembler is "best" depends on the 
particular dataset as well as the performance 
metric used. In part this is a consequence of 
metrics for partial assemblies: there is an inher- 
ent tradeoff between larger contiguous fragments 
(contigs) and fewer mistakes in merging contigs 
(misjoins). But more fundamentally, indepen- 
dent of the metric, performance depends criti- 
cally on the dataset, i.e. length, number, and 
quality of the reads, as well as the complexity 
of the genome sequence. With an eye towards 
the near future, we seek to understand the inter- 
play between these factors by using the intuitive 
and unambiguous metric of complete reconstruc- 
tior[^ Note that this objective of reconstruct- 

^The notion of complete reconstruction can be thought 
of as a mathematical idealization of the notion of "fin- 
ishing" a sequencing project as defined by the National 
Human Genome Research Institute |18) . where finishing 
a chromosome requires at least 95% of the chromosome 



ing the original DNA sequence from the reads 
contrasts with the many optimization-based for- 
mulations of assembly, such as shortest common 
superstring (SCS) [T, maximum-likelihood [161, 
IXlq, and various graph-based formulations [22j, 
jl4j . When solving one of these alternative for- 
mulations, there is no guarantee that the optimal 
solution is indeed the original sequence. 

Given the goal of complete reconstruction, the 
most basic questions are 1) feasibility: given a 
set of reads, is it possible to reconstruct the orig- 
inal sequence? 2) optimality: which algorithms 
can successfully reconstruct whenever it is fea- 
sible to reconstruct? The feasibility question is 
a measure of the intrinsic information each read 
provides about the DNA sequence, and for given 
sequence statistics depends on characteristics of 
the sequencing technology such as read length 
and noise statistics. As such, it can provide an 
algorithm-independent basis for evaluating the 
efficiency of a sequencing technology. Equally 
important, algorithms can be evaluated on their 
relative read length and data requirements, and 
compared against the fundamental limit. 

In studying these questions, we consider the 
most basic shotgun sequencing model where A'' 
noiseless read^ of a fixed length L base pairs 
are uniformly and independently drawn from a 
DNA sequence of length G. In this statistical 
model, feasibility is rephrased as the question of 
whether, for given sequence statistics, the cor- 
rect sequence can be reconstructed with proba- 
bility 1 — e when reads of length L are sampled 
from the genome. We note that answering the 
feasibility question of whether each N, L pair is 
sufficient to reconstruct is equivalent to finding 
the minimum required N (or the coverage depth 
c = NL/G) as a function of L. 

A lower bound on the minimum coverage 
depth needed was obtained by Lander and Wa- 
terman [9^. Their lower bound clw = CYSN^L^e) 
is the minimum number of randomly located 
reads needed to cover the entire DNA sequence 
with a given target success probability 1 — e. 
While this is clearly a necessary condition, it is 



to be represented by a contiguous sequence. 

^Reads are thus exact subsequences of the DNA. 
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in general not tight: only requiring the reads 
to cover the entire genome sequence does not 
guarantee that consecutive reads can actually be 
stitched back together to recover the original se- 
quence. Characterizing when the reads can be 
reliably stitched together, i.e. determining fea- 
sibility, is an open problem. In fact, the ability 
to reconstruct depends crucially on the repeat 
statistics of the DNA sequence. 

An earlier work [T3] has answered the feasi- 
bility and optimality questions under an i.i.d. 
model for the DNA sequence. However, real 
DNA, especially those of eukaryotes, have much 
longer and complex repeat structures. Here, we 
are interested in determining feasibility and opti- 
mality given arbitrary repeat statistics. This al- 
lows us to evaluate algorithms on statistics from 
already sequenced genomes, and gives confidence 
in predicting whether the algorithms will be use- 
ful for an unseen genome with similar statistics. 

1.2 Results 

Our approach results in a pipeline, which takes 
as input a genome sequence and desired success 
probability 1 — e, computes a few simple repeat 
statistics, and from these statistics computes a 
feasibility plot that indicates for which L,N re- 
construction is possible. Fig. [T] displays the sim- 
plest of the statistics, the number of repeats 
function of the repeat length £. Fig. [2] shows 
the resulting feasibility plot produced for the 
statistics of human chromosome 19 (henceforth 
hcl9) with success probability 99%. The hori- 
zontal axis signifies read length L and the verti- 
cal axis signifies the normalized coverage depth 
c := cjcYSN^ the coverage depth c normalized by 
clW) the coverage depth required as per Lander- 
Waterman [S] in order to cover the sequence. 

Since the coverage depth must satisfy c > clwj 
the normalized coverage depth satisfies c > 1, 
and we plot the horizontal line c = 1. This 
lower bound holds for any assembly algorithm. 
In addition, there is another lower bound, shown 
as the thick black nearly vertical line in Fig. [2] 
In contrast to the coverage lower bound, this 
lower bound is a function of the repeat statis- 
tics. It has a vertical asymptote at Lcrit := 



im«~^206o' ~ 3000 4660 

Figure 1: For hcl9, a log plot of number of repeats 
as a function of the repeat length I. Red line is what 
would have been predicted by an i.i.d. fit. 
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Figure 2: Thick black lines are lower bounds on 
feasibility which holds for all algorithms, and col- 
ored curves are performance achieved by specific al- 
gorithms. Four such curves are shown: the greedy al- 
gorithm and three de Brujin graph based algorithms. 

max {^intgrieaved) ^triple} ~l~ Ij where interleaved is 

the length of the longest interleaved repeats in 
the DNA sequence and triple is the length of the 
longest triple repeat (see Section [2] for precise 
definitions). Our lower bound can be viewed as 
a generalization of a result of Ukkonen for 
Sequencing by Hybridization to the shotgun se- 
quencing setting. 

Each colored curve in the feasibility plot is the 
lower boundary of the set of feasible A^, L pairs 
for a specific algorithm. The rightmost curve is 
the one achieved by the greedy algorithm, which 
merges reads with largest overlaps first (used for 
example in TIGR |25| . CAPS [S], and more re- 
cently SSAKE As seen in Fig. |2] its per- 
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formance curve asymptotes at L 
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length of the longest repeat. De Brujin graph 
based algorithms (e.g. jO] and [22]) take a more 
global view via the construction of a de Brujin 
graph out of all the K-mers of the reads. The 
performance curves of all K-mer graph based al- 
gorithms asymptote at read length L = Lcrit, 
but different algorithms use read information in 
a variety of ways to resolve repeats in the K-mer 
graph and thus have different coverage depth re- 
quirement beyond read length Lcrit- By com- 
bining the ideas from several existing algorithms 
(including [22j, ^) we designed MultiBridg- 
ING, which is very close to the lower bound for 
this dataset. Thus Fig. [2] answers, up to a very 
small gap, the feasibility of assembly for the re- 
peat statistics of hcl9, where successful recon- 
struction is desired with probability 99%. 

We produce similar plots for a dozen or so 
datasets (see supplementary material). For 
datasets where ^interleaved is significantly larger 
than ^triple (the majority of the datasets we 
looked at, including those used in the re- 
cent GAGE assembly algorithm evaluation [23^ ) , 
MultiBridging is near optimal, thus allow- 
ing us to characterize the fundamental limits 
for these repeat statistics (Fig. [9]). On the 
other hand, if ^triple is close to or larger than 
^interleaved) there is a gap between the perfor- 
mance of MultiBridging and the lower bound 
(see for example Fig. [3). The reason for the gap 
is explained in Section 3.4 
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Figure 3: Performance of MultiBridging on P 

MarinUS, where ^triple > ^interleaved- 

An interesting feature of the feasibility plots 
is that for typical repeat statistics exhibited by 
DNA data, the minimum coverage depth is char- 
acterized by a critical phenomenon: If the read 



length L is below Lcrit = ^interleaved, reliable re- 
construction of the DNA sequence is impossible 
no matter what the coverage depth is, but if the 
read length L is slightly above Lcrit, then cov- 
ering the sequence suffices, i.e. c = cjcuN = 1- 
The sharpness of the critical phenomenon is de- 
scribed by the size of the critical window, which 
refers to the range of L over which the transition 
from one regime to the other occurs. For the 
case when MultiBridging is near optimal, the 
width W of the window size can be well approx- 
imated as: 



W 



L 



crit 



where r := 



2r + 1 log 

For the hcl9 dataset, the critical window size 
evaluates to about 19% of Lent- 

In Sections |2] and [3j we discuss the underly- 
ing analysis and algorithm design supporting the 
plots. The curves are all computed from formu- 
las, which are validated by simulations in Sec- 
tion |4| We return in Section [5] to put our con- 
tributions in a broader perspective and discuss 
extensions to the basic framework. All proofs 
can be found in the appendix. 

2 Lower bounds 

In this section we discuss lower bounds, due to 
coverage analysis and certain repeat patterns, 
on the required coverage depth and read length. 
The style of analysis here is continued in Sec- 
tion [3j in which we search for an assembly algo- 
rithm that performs close to the lower bounds. 

2.1 Coverage bound 

Lander and Waterman's coverage analysis [3] 
gives the well known condition for the number 
of reads A'^lw required to cover the entire DNA 
sequence with probability at least 1 — e. In the 
regime when L <^ G, one may make the stan- 
dard assumption that the starting locations of 
the reads follow a Poisson process with rate 
A = N/G, and the number Alw is to a very 
good approximation given by the solution to the 
equation 

G, Alw 



Alw 



L 



log 



(2) 
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The corresponding coverage depth is clw = 
NysmL/G. This is our basehne coverage depth 
against which to compare the coverage depth of 
various algorithms. For each algorithm, we will 
plot 

c _ N 

Clw ^lw ' 

the coverage depth required by that algorithm 
normalized by clw- Note that c is also the ratio 
of the number of reads required by an algo- 
rithm to A'^LW- The requirement c > 1 is due to 
the lower bound on the number of reads obtained 
by the Lander- Waterman coverage condition. 

2.2 Ukkonen's condition 

A second constraint on reads arises from repeats. 
A lower bound on the read length L follows from 
Ukkonen's condition |[2B]: if there are interleaved 
repeats or triple repeats in the sequence of length 
at least L — 1, then the likelihood of observing 
the reads is the same for more than one possible 
DNA sequence and hence correct reconstruction 
is not possible. Fig. [4] shows an example with 
interleaved repeats. (Note that we assume 1— e > 
1/2, so random guessing between equally likely 
sequences is not viable.) 
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Figure 4: The likelihood of observing the reads un- 
der two possible sequences (the green and magenta 
segments swapped) is the same. Here, the two red 
subsequences form a repeat and the two orange sub- 
sequences form another repeat. 

We take a moment to carefully define the var- 
ious types of repeats. Let denote the length-^ 
subsequence of the DNA sequence s starting at 
position t. A repeat of length £ is a subsequence 
appearing twice, at some positions ti, t2 (so = 
sf^) that is maximal (i.e. s(ii — 1) s{t2 — l) and 
s{ti+i) 7^ s{t2+i))- Similarly, a triple repeat of 
length £ is a subsequence appearing three times, 
at positions ti,t2,t^, such that = sf^ = sf^. 



and such that neither of s(ti — 1) = s{t2 — 1) = 
s{t3-l) nor s{ti+e) = s{t2+i) = s{t3+£) hold^ 
A copy is a single one of the instances of the sub- 
sequence's appearances. A pair of repeats refers 
to two repeats, each having two copies. A pair of 
repeats, one at positions ti,t3 with ti < t^ and 
the second at positions t2,tA with t2 < t^, is in- 
terleaved if ti < t2 < ^3 < ti or t2 < ^1 < ^4 < is 
(Fig. |4|. The length of a pair of interleaved re- 
peats is defined to be the length of the shorter 
of the two repeats. 

Ukkonen's condition implies a lower bound on 
the read length, 

L > Lcrit := maxj^interleaved, ^triple} + 1 • 

Here ^interleaved is the length of the longest pair 
of interleaved repeats on the DNA sequence and 
Aripie is the length of the longest triple repeat. 
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Figure 5: A subsequence sf is bridged if and only 
if there exists at least one read which covers at least 
one base on both sides of the subsequence, i.e. the 
read arrives in the preceding length L — £—l interval. 

Ukkonen's condition says that for read lengths 
less than Lcrit; reconstruction is impossible no 
matter what the coverage depth is. But it can be 
generalized to provide a lower bound on the cov- 
erage depth for read lengths greater than Lcrit; 
through the important concept of bridging as 
shown in Figure |5j We observe that in Ukkonen's 
interleaved or triple repeats, the actual length of 
the repeated subsequences is irrelevant; rather, 
to cause confusion it is enough that all the copies 
of the pertinent repeats are unbridged. This 
leads to the following theorem. 

Theorem 1. Given a DNA sequence s and a 
set of reads, if there is a pair of interleaved re- 
peats or a triple repeat whose copies are all un- 
bridged, then there is another sequence s' of the 
same length under which the likelihood of observ- 
ing the reads is the same. 

^Note that a subsequence that is repeated / times gives 
rise to (j) repeats and (g) triple repeats. 
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For brevity, we will call a repeat or a triple re- 
peat bridged if at least one copy of the repeat is 
bridged, and a pair of interleaved repeats bridged 
if at least one of the repeats is bridged. Thus, 
the above theorem says that a necessary condi- 
tion for reconstruction is that all interleaved and 
triple repeats are bridged. 

How does Theorem [T] imply a lower bound on 
the coverage depth? Focus on the longest pair of 
interleaved repeats and suppose the read length 
L is between the lengths of the shorter and the 
longer repeats. The probability this pair is un- 
bridged is (p™''''^'^^*^'^ )2 ^J^gj-g 

length subseq. is unbridged] 



unbridged 



gf(L-£-l)+_ 



(3) 

Theorem [T] implies that the probability of mak- 
ing an error in the reconstruction is at least 1/2 
if this event occurs. Hence, the requirement that 
-ferror ^ £ implies a lower bound on the number 
of reads N: 

N > . (4) 

(L - interleaved " 1) ln(l/(2e)) 

A similar lower bound can be derived using the 
longest triple repeat. A slightly tighter lower 
bound can be obtained by taking into considera- 
tion the bridging of all the interleaved and triple 
repeats, not only the longest one, resulting in the 
black curve in Fig. |2j 

3 Towards optimal assembly 

We now begin our search for algorithms perform- 
ing close to the lower bounds derived in the pre- 
vious section. Algorithm assessment begins with 
obtaining deterministic sufficient conditions for 
success in terms of repeat-bridging. We then 
find the necessary and L in order to satisfy 
these sufficient conditions with a target proba- 
bility 1 — e. The required coverage depth for 
each algorithm depends only on certain repeat 
statistics extracted from the DNA data, which 
may be thought of as sufficient statistics. 



lows. Starting with the initial set of reads, the 
two fragments (i.e. subsequences) with maxi- 
mum length overlap are merged, and this opera- 
tion is repeated until a single fragment remains. 
Here the overlap of two fragments x,y is a suf- 
fix of x equal to a prefix of y, and merging two 
fragments results in a single longer fragment. 

Theorem 2. Greedy reconstructs the original 
sequence s if every repeat is bridged. 

Theorem [2] allows us to determine the coverage 
depth required by Greedy: we must ensure that 
all repeats are bridged. By the union bound, 

P[some repeat is unbridged] < ^ a„ (p^^^^^^s'^'^) ' 

m 

(5) 

where pj^^''^'^^*^'^ is defined in ^ and is the 
number of repeats of length m. Setting the right- 
hand side of ([5]) to e ensures Porror < e and yields 
the performance curve of Greedy in Fig. |2} 
Note that the repeat statistics {am} are sufficient 
to compute this curve. 

Greedy requires L > Repeat + 1, whereas 
the lower bound has its asymptote at L = 
interleaved + 1- In chromosome 19, for instance, 
there is a large difference between ^interleaved = 
2248 and iepeat — 4092, and in Fig [2] we see 
a correspondingly large gap. Greedy is evi- 
dently sub-optimal in handling interleaved re- 
peats. Its strength, however, is that once the 
reads are slightly longer than Repeat, coverage of 
the sequence is sufficient for correct reconstruc- 
tion. Thus if Repeat ~ interleaved, then GREEDY 

is close to optimal. 

3.2 K-mer algorithms 

The greedy algorithm fails when there are un- 
bridged repeats, even if there are no unbridged 
interleaved repeats, and therefore requires a read 
length much longer than that required by Ukko- 
nen's condition. As we will see, K-mei algo- 
rithms do not have this limitation. 



3.1 Greedy algorithm 

The greedy algorithm, denoted Greedy, with 



pseudocode in section C.l is described as fol- 



3.2.1 Background 

In the introduction we mention Sequencing By 
Hybridization (SBH), for which Ukkonen's con- 
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dition was originally introduced. In the SBH set- 
ting, an optimal algorithm matching Ukkonen's 
condition is known, due to Pevzner [21] . 

Pevzner's algorithm is based on finding an ap- 
propriate cycle in a i^-mer graph (also known 
as a de Bruijn graph) with K = L — 1 (see e.g. 
m for an overview). A K-mei graph is formed 
by first creating a node in the graph for each 
unique K-mer (length K subsequence) in the set 
of reads, and then adding an edge with overlap 
K — 1 between any two nodes representing K- 
mers that are adjacent in a read, i.e. offset by 
a single nucleotide. Edges thus correspond to 
unique {K + l)-mers in s and paths correspond 
to longer subsequences obtained by merging the 
constituent nodes. There exists a cycle corre- 
sponding to the original sequence s, and recon- 
struction entails finding this cycle. 

As is common, we will replace edges corre- 
sponding to an unambiguous path by a single 
node (c.f. Fig. [6|. Since the subsequences at 
some nodes are now longer than K, this is no 
longer a K-mei graph, and we call the more 
general graph a sequence graph. The simplified 
graph is called the condensed sequence graph. 

Figure 6: Contracting an edge by merging the in- 
cident nodes. Repeating this operation results in the 
condensed graph. 

The condensed graph has the useful property 
that if the original sequence s is reconstructible, 
then s is determined by a unique Eulerian cycle: 

Theorem 3. Let Go he the K-mer graph con- 
structed from the {K + l)-spectrum Sk+i of s, 
and let G be the condensed sequence graph ob- 
tained from Go- // Ukkonen's condition is satis- 
fied, i.e. there are no triple or interleaved repeats 
of length at least K, then there is a unique Eu- 
lerian cycle C in G and C corresponds to s. 

Theorem [3] characterizes, deterministically, 
the values of K for which reconstruction from 
the {K + l)-spectrum is possible. We proceed 
with application of the K-mer graph approach 
to shotgun sequencing data. 




3.2.2 Basic i^-mer algorithm 

Starting with Idury and Waterman [6_ , and then 
Pevzner et al.'s [22] euler algorithm, most cur- 
rent assembly algorithms for shotgun sequencing 
are based on the K-mev graph. Idury and Wa- 
terman [6] made the key observation that SBH 
with subsequences of length K -\- 1 can be em- 
ulated by shotgun sequencing if each read over- 
laps the subsequent read by K: the set of all 
{K -\- l)-mers within the reads is equal to the 
(i^-l-l)-spectrum Sk+i- The resultant algorithm 
DeBruijn which consists of constructing the K- 
mer graph from the (-fC-l-l)-spectrum observed in 
the reads, condensing the graph, and then identi- 
fying an Eulerian cycle, has sufficient conditions 
for correct reconstruction as follows. 

Theorem 4. DeBruijn with parameter choice 
K reconstructs the original sequence s if: 

(a) K > ^interleaved 

(b) K > itriple 

(c) adjacent reads overlap by at least K 

Lander and Waterman's coverage analysis ap- 
plies also to Condition (c) of Theorem |4] yield- 
ing a normalized coverage depth requirement 
c = 1/(1 — K/L). The larger the overlap K, 
the higher the coverage depth required. Condi- 
tions (a) and (b) say that the smallest K one can 

choose is = max{4riple, interleaved} + 1, SO 



(6) 
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The performance of DeBruijn is plotted 
in Fig. |2j DeBruijn significantly improves 
on Greedy by obtaining the correct first 
order performance: given sufficiently many 
reads, the read length L may be decreased to 
max{4ripie, interleaved} + 1- Still, the number of 
reads required to approach this critical length is 
far above the lower bound. The following sub- 
section pursues reducing K in order to reduce 
the required number of reads. 

3.3 Improved K-mer algorithms 

Algorithm DeBruun ignores a lot of informa- 
tion contained in the reads, and indeed all of 
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the K-mei: based algorithms proposed by the se- 
quencing community (including [B] , [22] , , [1| , 
jlU| . [29] ) use the read information to a greater 
extent than the naive DeBruijn algorithm. Bet- 
ter use of the read information, as described be- 
low in algorithms SimpleBridging and Multi- 
Bridging, will allow us to relax the condition 

K > max{£interleaved, triple} for SUCCeSS of DE- 

Bruijn, which in turn reduces the high coverage 
depth required by Condition (c). 

Existing algorithms use read information in a 
variety of distinct ways to resolve repeats. For 
instance, Pevzner et al. [22] observe that for 
graphs where each edge has multiplicity one, if 
one copy of a repeat is bridged, the repeat can be 
resolved through what they call a "detachment". 
The algorithm SimpleBridging described be- 
low is very similar, and resolves repeats with two 
copies if at least one copy is bridged. 

Meanwhile, other algorithms are better suited 
to higher edge multiplicities due to higher order 
repeats; IDBA (Iterative DeBruijn Assembler) 
|19j creates a series of K-mer graphs, each with 
larger K, and at each step uses not just the reads 
to identify adjacent K-meis, but also all the un- 
bridged paths in the K-mei graph with smaller 
K. Although not stated explicitly in their paper, 
we observe here that if all copies of every repeat 
are bridged, then IDBA correctly reconstructs. 

However, it is suboptimal to require that all 
copies of every repeat up to the maximal K be 
bridged. We introduce MultiBridging, which 
combines the aforementioned ideas to simulta- 
neously allow for single-bridged double repeats, 
triple repeats in which all copies are bridged, and 
unbridged non-interleaved repeats. 

3.3.1 SimpleBridging 

SimpleBridging improves on DeBruijn by re- 
solving bridged 2-repeats (i.e. a repeat with ex- 
actly two copies in which at least one copy is 
bridged by a read) . Condition (a) K > ^interleaved 
for success of DeBruijn (ensuring that no inter- 
leaved repeats appear in the initial K-mei graph) 
is updated to require only no unbridged inter- 
leaved repeats, which matches the lower bound. 
With this change. Condition (b) K > ^triple 



forms the bottleneck for typical DNA sequences. 
Thus SimpleBridging is optimal with respect 
to interleaved repeats, but it is suboptimal with 
respect to triple repeats. 

SimpleBridging deals with repeats by per- 
forming surgery on certain nodes in the sequence 
graph. In the sequence graph, a repeat corre- 
sponds to a node we call an X-node, a node with 
in-degree and out-degree each at least two (e.g. 
Fig. [7|. A self-loop adds one each to the in- 
degree and out-degree. The cycle C(s) traverses 
each X-node at least twice, so X-nodes corre- 
spond to repeats in s. We call an X-node tra- 
versed exactly twice a 2-X-node; these nodes cor- 
respond to 2-repeats, and are said to be bridged 
if the corresponding repeat in s is bridged. 

In the repeat resolution step of SimpleBridg- 
ing (illustrated in Fig. [7|, bridged 2-X-nodesare 
duplicated in the graph and incoming and out- 
going edges are inferred using the bridging read, 
reducing possible ambiguity. 
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Figure 7: An example of the bridging step in Sim- 
pleBridging. 

Theorem 5. SimpleBridging with parameter 
choice K reconstructs the original sequence s if: 

(a) all interleaved repeats are bridged 

(b) K > itriple 

(c) adjacent reads overlap by at least K. 
By the union bound, 

P[some interleaved repeat is unbridged] 

< 



where hm,n is the number of interleaved repeats 
in which one repeat is of length m and the other 
is of length n. To ensure that condition (a) 
in the above theorem fails with probability no 
more than e, the right hand side of ([T]) is set 
to be e; this imposes a constraint on the cov- 
erage depth. Furthermore, conditions (b) and 
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(c) imply that the normahzed coverage depth 
c > 1/(1 — (triple + 1)/-^) • These two constraints 
together yield the performance curve of Simple- 
Bridging in Figure [2j 

3.3.2 MultiBridging 

We now turn to triple repeats. As previously ob- 
served, it can be challenging to resolve repeats 
with more than one copy [22], because an edge 
into the repeat may be paired with more than 
one outgoing edge. As discussed above, our ap- 
proach here shares elements with IDEA [12] : we 
note that increasing the node length serves to re- 
solve repeats. Unlike IDBA, we do not increase 
the node length globally. 

As noted in the previous subsection, repeats 
correspond to nodes in the sequence graph we 
call X-nodes. Here the converse is false: not all 
repeats correspond to X-nodes. A repeat is said 
to be all-bridged if all repeat copies are bridged, 
and an X-node is called all-bridged if the corre- 
sponding repeat is all-bridged. 



..AATTGCAAG.. 



..GATTGCAACG- 



..CATTGCAACT... 



CAAG 



AATT 



GATT 



CATT 




CATTGCAA 



Figure 8: MultiBridging resolves an X-node with 
label ATTGCAA corresponding to a triple repeat. 

The requirement that triple repeats be all- 
bridged allows them to be resolved locally 
(Fig. [8|. The X-node resolution procedure given 
in Step 4 of MultiBridging can be interpreted 
in the i^T-mer graph framework as increasing K 
locally so that repeats do not appear in the 
graph. In order to do this, we introduce the 
following notation for extending nodes: Given 



an edge (v,q) with weight av,q, let v"^'^ de- 
note V extended one base to the right along 

_i (notation introduced 



(v,qj. I.e. 
in Sec. 



2.2). Similarly, let 



Pcnd— Opv ^ ' 



MultiBridging is described as follows. 

Algorithm 1 MultiBridging. Input: reads 
7^, parameter K. Output: sequence s. 
K-mer steps 1-3: 

1. For each subsequence x of length K m.a read, 
form a node with label x. 

2. For each read, add edges between nodes rep- 
resenting adjacent X-mers in the read. 

3. Condense the graph (c.f. Fig. [6]). 

4. Bridging step: (See Fig. |8]). While there ex- 
ists a bridged X-node v: (i) For each edge (pj, v) 
with weight Op-^v, create a new node Uj = p*~^v 
and an edge (pj,Uj) with weight 1 -|- ap.^v Sim- 
ilarly for each edge (v,qj), create a new node 



w 



and edge (wj,qj). (ii) If v has a 
self-loop (v, v) with weight av,v) add an edge 
(v^^,'^^v) with weight av,v + 2. (iii) Remove 
node V and all incident edges, (iv) For each pair 
Uj,Wj adjacent in a read, add edge (uj,Wj). If 
exactly one each of the Uj and Wj nodes have no 
added edge, add the edge, (v) Condense graph. 
5. Finishing step: Find an Eulerian cycle in the 
graph and return the corresponding sequence. 

Theorem 6. The algorithm MultiBridging 

reconstructs the sequence s if: 

(a) all interleaved repeats are bridged 

(b) all triple repeats are all-bridged 

(c) the sequence is covered by the reads. 

A similar analysis as for SimpleBridging 
yields the performance curve of MultiBridg- 
ing in Figure [2] 

3.4 Gap to lower bound 

The only difference between the sufficient condi- 
tion guaranteeing the success of MultiBridg- 
ing and the necessary condition of the lower 
bound is the bridging condition of triple re- 
peats: while MultiBridging requires bridging 
all three copies of the triple repeats, the nec- 
essary condition requires only bridging a sin- 
gle copy. When ^triple is significantly smaller 
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than ^interleaved) the bridging requirement of in- 
terleaved repeats dominates over that of triple 
repeats and MultiBridging achieves very close 
to the lower bound. This occurs in hcl9 and the 
majority of the datasets we looked at. (See Fig. [9] 
and the plots in the supplementary material.) A 
critical phenomenon occurs as L increases: for 
L < Lcrit reconstruction is impossible, over a 
small critical window the bridging requirement of 
interleaved repeats (primarily the longest) dom- 
inates, and then for larger L, coverage suffices. 

On the other hand, when ^triple is compara- 
ble or larger than interleaved, then MultiBridg- 
ing has a gap in the coverage depth to the lower 
bound (see for example Fig.|3]). If we further as- 
sume that the longest triple repeat is dominant, 
then this gap can be calculated to be a factor of 

3 • ^ 3.72 for e = lO^^. This gap occurs 
only within the critical window where the repeat- 
bridging constraint is active. Beyond the critical 
window, the coverage constraint dominates and 
MultiBridging is optimal. Further details are 
provided in the appendices. 

4 Simulations and complexity 

In order to verify performance predictions, we 
implemented and ran the algorithms on simu- 
lated error- free reads from sequenced genomes. 
For each algorithm, we sampled (A'^, L) points 
predicted to give < 5% error, and recorded 
the number of times correct reconstruction was 
achieved out of 100 trials. Fig. [9] shows results 
for the three GAGE reference sequences. 

We now estimate the run-time of MultiB- 
ridging. The algorithm has two phases: the K- 
mer graph formation step, and the repeat resolu- 
tion step. The K-mei graph formation runtime 
can be easily bounded by 0{{L—K)NK), assum- 
ing 0{K) look-up time for each of the (L — K)N 
K-meis observed in reads. This step is common 
to all K-mer graph based algorithms, so previ- 
ous works to decrease the practical runtime or 
memory requirements are applicable. 

The repeat resolution step depends on the 
repeat statistics and choice of K. It can be 
loosely bounded as 0( J2f'=K LJ2'^^^ repeats x dx)- 

V of length e / 



The second sum is over distinct maximal re- 
peats X of length £ and dx is the number of (not 
necessarily maximal) copies of repeat x. The 
bound comes from the fact that each maximal 
repeat of length K < I < L is resolved via ex- 
actly one bridged X-node, and each such reso- 
lution requires examining at most the Ldx dis- 
tinct reads that contain the repeat. We note 

that y ] (} ^ L y ] max repeats x dx ^ L ^ ] ^ ^ CL£ , aud 

of length t 

the latter quantity is easily computable from our 
sufficient statistics. 

For our data sets, with appropriate choice of 
K, the bridging step is much simpler than the K- 
mer graph formation step: for R. sphaeroides we 
use K = 40 to get J2e=K = 412; in contrast, 
N > 22421 for the relevant range of L. Similarly, 
for hcl4, using K = 300, T,f=K L<^t = 661 while 
N > 733550; for S. Aureus, J2f=K Lae = 558 
while iV > 8031. 

5 Discussions and extensions 

The notion of optimal shotgun assembly is not 
commonly discussed in the literature. One rea- 
son is that there is no universally agreed-upon 
metric of success. Another reason is that most 
of the optimization-based formulations of assem- 
bly have been shown to be NP-hard, includ- 
ing Shortest Common Superstring [3], [7], De 
Bruijn Superwalk [22], [12], and Minimum s- 
Walk on the string graph [TT, \T2\. Thus, it 
would seem that optimal assembly algorithms 
are out of the question from a computational per- 
spective. What we show in this paper is that if 
the goal is complete reconstruction, then one can 
define a clear notion of optimality, and moreover 
there is a computationally efficient assembly al- 
gorithm (MultiBridging) that is near optimal 
for a wide range of DNA repeat statistics. So 
while the reconstruction problem may well be 
NP-hard, typical instances of the problem seem 
much easier than the worst possibility al- 

ready suggested by Nagarajan and Pop [17i. 

The MultiBridging algorithm is near opti- 
mal in the sense that, for a wide range of repeat 
statistics, it requires the minimum read length 
and minimum coverage depth to achieve com- 
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read length L read length L read length L 



(a) S. Aureus (b) R. sphaeroides (c) hcl4 

Figure 9: Simulation results for each of the GAGE reference genomes. Each simulated {N, L) point 
is marked with the number of correct reconstructions (e.g. 93, 98, 95) on 100 simulated read sets. All 
four algorithms (Greedy, DeBruijn, SimpleBridging, and MultiBridging) were run on S. Aureus, R. 
sphaeroides and hcl4. Note that MultiBridging is very close to the lower bound on all 3 datasets. 



plete reconstruction. However, since the repeat 
statistics of a genome to be sequenced are usu- 
ally not known in advance, this minimum re- 
quired read length and minimum required cov- 
erage depth may also not be known in advance. 
In this context, it would be useful for the Multi- 
Bridging algorithm to validate whether its as- 
sembly is correct. More generally, an interesting 
question is to seek algorithms which are not only 
optimal in their data requirements but also pro- 
vide a measure of confidence in their assemblies. 

How realistic is the goal of complete recon- 
struction given current-day sequencing technolo- 
gies? The minimum read lengths Lcrit required 
for complete reconstruction on the datasets we 
examined are typically on the order of 500 — 
3000 base pairs (bp). This is substantially 
longer than the reads produced by Illumina, the 
current dominant sequencing technology, which 
produces reads of lengths 100-200bp; however, 
other technologies produce longer reads. PacBio 
reads can be as long as several thousand base 
pairs, and as demonstrated by [8j, the noise can 
be cleaned by Illumina reads to enable near- 
complete reconstruction. Thus our framework is 
already relevant to some of the current cutting 
edge technologies. To make our framework more 
relevant to short-read technologies such as Illu- 
mina, an important direction is to incorporate 
mate-pairs in the read model, which can help to 
resolve long repeats with short reads. Other ex- 
tensions to the basic shotgun sequencing model: 



heterogenous read lengths: This occurs in 
some technologies where the read length is ran- 
dom (e.g. Pacbio) or when reads from multi- 
ple technologies are used. Generalized Ukko- 
nen's conditions and the sufficient conditions of 
MultiBridging extend verbatim to this case, 
and only the computation of the bridging prob- 
ability ^ has to be slightly modified, 
non-uniform read coverage: Again, only the 
computation of the bridging probability has to 
be modified. One issue of interest is to investi- 
gate whether reads are sampled less frequently 
from long repeat regions. If so, our framework 
can quantify the performance hit. 
double strand: DNA is double-stranded and 
consists of a length-G sequence u and its re- 
verse complement u. Each read is either sampled 
from u or u. This more realistic scenario can be 
mapped into our single-strand model by defin- 
ing s as the length-2G concatenation of u and u, 
transforming each read into itself and its reverse 
complement so that there are 2N reads. General- 
ized Ukkonen's conditions hold verbatim for this 
problem, and MultiBridging can be applied, 
with the slight modification that instead of look- 
ing for a single Eulerian path, it should look for 
two Eulerian paths, one for each component of 
the sequence graph after repeat-resolution. An 
interesting aspect of this model is that, in addi- 
tion to interleaved repeats on the single strand u, 
reverse complement repeats on u will also induce 
interleaved repeats on the sequence s. 
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A Supplementary Material 



In this supplementary material, we display the 
output of our pipeline for 9 datasets (in addition 
to hcl9, whose output is in the introduction, and 
the GAGE datasets R. sphaeroides, S. Aureus, 
and hcl4). For each dataset we plot 

log(l + a^), 

the log of one plus the number of repeats of each 
length I. Prom the repeat statistics 

Qto) bjji ji^ and 

Cm, we produce a feasibility plot. The thick black 
line denotes the lower bound on feasible N, L, 
and the green line is the performance achieved 
by MultiBridging. 
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Figure 10: Lactofidus. G = 2,078,001, triple = 3027, interleaved = 3313, 4opoat 
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Figure 11: Buchnera. G = 642, 122, ^triple = 27, interleaved = 23, Repeat = 39. 
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Figure 12: HcliSl. G = 1,589,954, i^pie = 219, Wieaved = 2122, Repeat 



= 3478. 
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Figure 13: Salmonella. G = 2,215,568, triple = 112, Antorioavcd = 163, Repeat = 1011. 
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Figure 14: Pcrkinsus marinus. G = 1,440,372, ^Hpio = 770, t 
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= 92, Repeat - 1784. 
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Figure 15: Sulfolobus islandicus. G = 2, 655, 198, triple = 734, ^interleaved = 761, Repeat = 875. 
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Figure 16: Ecoli536. G = 4,938,920, triple = 2267, ^interleaved = 3245, Repeat = 3353. 
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Figure 17: Ycsnina. G = 4,504,254, triple - 3573, interleaved = 3627, Repeat = 5358. 
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B Lower bounds on coverage 
depth 



The lower bounds are based on a generalization 
of Ukkonen's condition to shotgun sequencing, as 
described in Theorem [T| The proof of Theorem [T] 
follows by a straightforward modification to the 
argument in \26\ and is omitted here. 



Theorem [TJ Given a DNA sequence s and a 
set of reads, if there is a pair of interleaved re- 
peats or a triple repeat whose copies are all un- 
hridged, then there is another sequence s' of the 
same length under which the likelihood of observ- 
ing the reads is the same. 



B.l Lower bound due to interleaved 
repeats 



In this section we derive a necessary condition on 
N and L in order that the probability of correct 
reconstruction be at least 1 — e. 

Recall that a pair of repeats, one at positions 
ti,t3 with ti < t^ and the second at positions 
t2,ti with t2 < ^4, is interleaved if ti < t2 < 
ta < t4 or t2 < ii < ^4 < is- From the DNA 
we may extract a (symmetric) matrix of inter- 
leaved repeat statistics bmn, the number of pairs 
of interleaved repeats of lengths m and n. 

We proceed by fixing both N and L and check- 
ing whether or not unbridged interleaved re- 
peats occur with probability higher than e. We 



will break up repeats into 2 categories: repeats 
of length at least L — 1 (these are always un- 
bridged), and repeats of length less than L — 1 
(these are sometimes unbridged). We assume 
that L > ^interleaved + 1; or equivalently bij = 
for all i,j > L — 1, since otherwise there are 
(with certainty) unbridged interleaved repeats 
and Ukkonen's condition is violated. 

First, we estimate the probability of error due 
to interleaved repeats of lengths i < L — 1 and 
j > L—1. The repeat of length j is too long to be 
bridged, so an error occurs if the repeat of length 
i is unbridged. For a repeat, as long as the two 
copies' locations are not too nearbjQ each copy 
is bridged independently and hence the proba- 
bility that both copies of the repeat of length i 
are unbridged is p^^^^ridged ^ ^_2g(L-i-i)_ 

call that a repeat is unbridged if both copies are 
unbridged.) 

A union bound estimat^ gives a probability 
of error 

1 - -2A(L-m-l)_ (^g^ 



2 S 

m<L-l 
n>L-l 



b„ 



Requiring the error probability to be less than e 
and solving for L gives the necessary condition 

where 71 := X]™<i-i bmn.e'^^^^^^^"^^^^ is a simple 

n>L-l 

function of the interleaved repeat statistic bmn- 
We now estimate the probability of error due 
to interleaved repeat pairs in which both repeats 
are shorter than L — 1. In this case only one 
repeat of each interleaved repeat pair must be 
bridged. Again a union bound estimate gives 

1 



E 



L -2A(L-m-l) -2A(L-n-l) 



m,n<L—l 



*More precisely, for the two copies of a a repeat of 
length £ to be bridged independently requires that no sin- 
gle read can bridge them both. This means their locations 
t and t + d must have separation d> L — — 2. 

^The union bound on probabilities gives an upper 
bound, so its use here is only an approximation. To get a 
rigorous lower bound we can use the inclusion-exclusion 
principle, but the difference in the two computations is 
negligible for the data we observed. For ease of exposi- 
tion we opt to present the simpler union bound estimate. 
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Requiring the error probability to be less than e 
gives the necessary condition 



where 72 := Em,«L-i ftmne^'"'""'"*"*^' and 

similarly to 71 is computed from bmn- 



B.2 Lower bound due to triple repeats 

We translate the generalized Ukkonen's condi- 
tion prohibiting unbridged triple repeats into a 
condition on L and N. Let Cm denote the num- 
ber of triple repeats of length m. Then a union 
bound estimate gives 



1 



-3A(L-m-l) 



Requiring < e and solving for L gives 



L > ^log^ = — log^, (12) 
- 3A ^ 2e 3iV ^ 2e ' ^ ' 



where 73 := Em Cme3(^/^)('"+i). 

Remark 7. As discussed here and in Section |2] if 
the DNA sequence is not covered by the reads or 
there are unbridged interleaved or triple repeats, 
then reconstruction is not possible. But there 
is another situation which must be ruled out. 
Without knowing its length a priori, it is im- 
possible to know how many copies of the DNA 
sequence are actually present: if the sequence 
s to be assembled consists of multiple concate- 
nated copies of a shorter sequence, rather than 
just one copy, the probability of observing any 
set of reads will be the same. Since it is un- 
likely that a true DNA sequence will consist of 
the same sequence repeated multiple times, we 
assume this is not the case throughout the pa- 
per. Equivalently, if s does consist of multiple 
concatenated copies of a shorter sequence, we are 
content to reconstruct a single copy. If available, 
knowledge of the approximate length of s would 
then allow to reconstruct. 



C Proofs for algorithms 

C.l Proof of Theorem [2] (Greedy) 

The greedy algorithm's underlying data struc- 
ture is the overlap graph, where each node rep- 
resents a read and each (directed) edge (x, y) is 
labeled with the overlap ov(x, y) (defined as the 
the length of the shared prefix/suffix) between 
the incident nodes' reads. For a node v, the 
in-degree [out-degree] is the number of edges in 
the graph directed towards [away from] v. The 
greedy algorithm is described as follows. 

Algorithm 2 Greedy. Input: reads IZ. Out- 
put: sequence s. 

1. For each read with sequence x, form a node 
with label x. 

Greedy steps 2-3: 

2. Consider all pairs of nodes xi,X2 in G satis- 
fying (iout(xi) = (iin(x2) = 0, and add an edge 
(xi,X2) with largest value ov(xi,X2). 

3. Repeat Step 2 until no candidate pair of nodes 
remains. 

Finishing step: 

4. Output the sequence corresponding to the 
unique cycle in G. 

Theorem [2j Given a sequence s and a set 
of reads, Greedy returns s if every repeat is 
bridged. 

Proof. We prove the contrapositive. Suppose 
Greedy makes its first error in merging reads 
Yi and Yj with overlap ov(rj,rj) = 1. Now, if 
Yj is the successor to r^, then the error is due 
to incorrectly aligning the reads; the other case 
is that Yj is not the successor of rj. In the first 
case, the subsequence sf_^. is repeated at location 
s^._i_2,_£, and no read bridges either repeat copy. 
In the second case, there is a repeat s^. = 

^t-+L-t If ^t +L-l is bridged by some read r^, 
then Yi has overlap at least (. + 1 with r^, imply- 
ing that read rj has already found its successor 
before step £ (either r^ or some other read with 
even higher overlap). A similar argument shows 
that s^. cannot be bridged, hence there is an un- 
bridged repeat. □ 
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C.2 Proofs for i^-mer algorithms 
C.2.1 Background 

We give some mathematical background leading 
to the proof of Theorem [s] (restated below). 

Lemma 8. Fix an arbitrary K and form the K- 
mer graph from the {K + 1)- spectrum Sk+i- The 
sequence s corresponds to a unique cycle C(s) 
traversing each edge at least once. 

To prove the lemma, note that all (i^+l)-mers 
in s correspond to edges and adjacent [K + 1)- 
mers in s are represented by adjacent edges. An 
induction argument shows that s corresponds to 
a cycle. The cycle traverses all the edges, since 
each edge represents a unique {K + l)-mer in s. 

In both SBH and shotgun sequencing the num- 
ber of times each edge e is traversed by C(s) 
(henceforth called the multiplicity of e) is un- 
known a priori, and finding this number is part 
of the reconstruction task. Repeated {K + 1)- 
mers in s correspond to edges in the K-m.ei graph 
traversed more than once by C(s), i.e. having 
multiplicity greater than one. In order to esti- 
mate the multiplicity, previous works seek a so- 
lution to the so-called Chinese Postman Problem 
(CPP), in which the goal is to find a cycle of the 
shortest total length traversing every edge in the 
graph (see e.g. ED], [B], [22], [H])- It is not 
obvious under what conditions the CPP solution 
correctly assigns multiplicities in agreement with 
C(s). For our purposes, as we will see in Theo- 
rem |3] the multiplicity estimation problem can 
be sidestepped (thereby avoiding solving CPP) 
through a modification to the iC-mer graph. 

Ignoring the issue of edge multiplicities for a 
moment, Pevzner [2T] showed for the SBH model 
that if the edge multiplicities are known with 
multiple copies of each edge included according 
to the multiplicities, and moreover Ukkonen's 
condition is satisfied, then there is a unique Eu- 
lerian cycle in the ET-mer graph and the Eulerian 
cycle corresponds to the original sequence. (An 
Eulerian cycle is a cycle traversing each edge ex- 
actly once.) Pevzner's algorithm is thus to find 
an Eulerian cycle and read off the corresponding 
sequence. Both steps can be done efficiently. 



Lemma 9 (Pevzner |2lJ). In the SBH setting, if 
the edge multiplicities are known, then there is a 
unique Eulerian cycle in the K-mer graph with 
K = L — 1 if and only if there are no unbridged 
interleaved repeats or unbridged triple repeats. 

Most practical algorithms (e.g. [B], (TUj, [2U] ) 
condense unambiguous paths (called unitigs by 
Myers [15] in a slightly different setting) for com- 
putational efficiency. The more significant ben- 
efit for us, as shown in Theorem [3j is that if 
Ukkonen's condition is satisfied then condensing 
the graph obviates the need to estimate multi- 
plicities. Condensing a i^-mer graph results in a 
graph of the following type. 

Definition 10 (Sequence graph). A sequence 
graph is a graph in which each node is labeled 
with a subsequence, and edges (u, v) are labeled 
with an overlap a^v such the subsequences u and 
V overlap by a^v (the overlap is not necessarily 
maximal). In other words, an edge label Ouv on 
e = (u, v) indicates that the Ouv-length suffix of 
u is equal to the Ouv-length prefix of v. 

The sequence graph generalizes both the over- 



lap graph used by Greedy in Section 3.1 (nodes 



correspond to reads, and edge overlaps are max- 
imal overlaps) as well as the K-mei algorithms 
discussed in this section (nodes correspond to K- 
mers, and edge overlaps are K — 1). 

In order to speak concisely about concatenated 
sequences in the sequence graph, we extend the 
notation sf (denoting the length-^ subsequence 
of the DNA sequence s starting at position t) 
which was introduced in Section 12.21 we abuse 



notation slightly, and write sf'^'^ to indicate the 
subsequence of s starting at position t and having 
length so that its end coincides with the end of 
s. 

We will perform two basic operations on the 
sequence graph. For an edge e = (u,v) with 
overlap a^v, merging u and v along e produces 

Contracting an 



the concatenation u?'^ v' 



cnd-_cnd 



1 

edge e = (u, v) entails two steps (c.f. Fig. |6]): 
first, merging u and v along e to form a new 
node w = uf'^'^vjj"^^^, and, second, edges to u 
are replaced with edges to w, and edges from 
V are replaced by edges from w. We will only 
contract edges (u,v) with (iout(u) = din(v) = 1. 
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The condensed graph is defined next. 

Definition 11 (Condensed sequence graph). 
The condensed sequence graph replaces unam- 
biguous paths by single nodes. Concretely, any 
edge e = {u,v) with dout(^*) = '^in(^') = 1 is con- 
tracted, and this is repeated until no candidate 
edges remain. 

For a path V = vi,V2,...,Vq in the origi- 
nal graph, the corresponding path in the con- 
densed graph is obtained by contracting an edge 
(vj, Vj_|_i) whenever it is contracted in the graph, 
replacing the node vi by w whenever an edge 
(u, vi) is contracted to form w, and similarly for 
the final node Vg. It is impossible for an inter- 
mediate node Vj, 2 < i < g, to be merged with a 
node outside of as this would violate the con- 
dition dontiu) = dia{v) = 1 for edge contraction 
in Defn. [TTl 

In the condensed sequence graph G obtained 
from a sequence s, nodes correspond to subse- 
quences via their labels, and paths in G cor- 
respond to subsequences in s via merging the 
constituent nodes along the path. If the subse- 
quence corresponding to a node v appears twice 
or more in s, we say that v corresponds to a re- 
peat. Conversely, subsequences of length i > K 
in s correspond to paths V of length £ — K + 1 
in the i^-mer graph, and thus by the previous 
paragraph also to paths in the condensed graph 
G. 

We record a few simple facts about the con- 
densed sequence graph obtained from a K-mev 
graph. 

Lemma 12. Let Go be the K-mer graph con- 
structed from the {K + \)-spectrum of s and let 
Co = Co(s) he the cycle corresponding to s. In the 
condensed graph G, let C be the cycle obtained 
from Co by contracting the same edges as those 
contracted in Gq. 

1. Edges in Gq can be contracted in any order, 
resulting in the same graph G, so the con- 
densed graph is well-defined. Similarly C is 
well-defined. 

2. The cycle C in G corresponds to s and is the 
unique such cycle. 

3. The cycle C in G traverses each edge at least 
once. 



Theorem |3| Let Sk+i be the [K + \)- spectrum 
ofs and Gq be the K-mer graph constructed from 
Sk+1, and let G be the condensed sequence graph 
obtained from Gq. // Ukkonen's condition is sat- 
isfied, i.e. there are no triple repeats or inter- 
leaved repeats of length at least K, then there is 
a unique Eulerian cycle C in G and C corresponds 
to s. 

Proof. We will show that if Ukkonen's condition 
is satisfied, the cycle C = C(s) in G correspond- 



ing to s (constructed in Lemma 12 ) traverses 
each edge exactly once in the condensed K-meT 
graph, i.e. C is Eulerian. Pevzner's [21] argu- 
ments show that if there are multiple Eulerian 
cycles then Ukkonen's condition is violated, so it 
is sufficient to prove that C is Eulerian. As noted 



in Lemma 12 C traverses each edge at least once. 



and thus it remains only to show that C traverses 
each edge at most once. 

To begin, let Cq be the cycle corresponding to 
s in the original K-mer graph Gq. We argue 
that every edge (u,v) traversed twice by Cq in 
the K-mei graph Gq has been contracted in the 
condensed graph G and hence in C Note that the 
cycle Cq does not traverse any node three times in 
Go, for this would imply the existence of a triple 
repeat of length K, violating the hypothesis of 
the Lemma. It follows that the node u cannot 
have two outgoing edges in Go as u would then 
be traversed three times; similarly, v cannot have 
two incoming edges. Thus (iout(u) = din(v) = 1 



and, as prescribed in Defn. 11, the edge (u,v) 
has been contracted. □ 

C.2.2 Proofs for SimpleBridging 

Since bridging reads extend one base to either 
end of a repeat, it will be convenient to use 
the following notation for extending sequences: 
Given an X-node v with an incoming edge (p, v) 
and an outgoing edge (v, q), let 



1+1' 



and 



Pcnd- 



V . 



(13) 

Here v"^'' denotes the subsequence v appended 
with the single next base in the merging of v 
and q and p^v the subsequence v prepended 
with the single previous base in the merging of p 
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and V. For example, if v 

flpv 



ATTC, p = TCAT, 
2, q = TTCGCC, and 

flvq — 3, then 
v^q = ATTCG, P^v = CATTC, and p^v^^i = 
CATTCG. 

The idea is that a bridging read is consistent 
with only one pair p~^v and v^'' and thus al- 
lows to match up edge (p,v) with (v,q). This 
is recorded in the following lemma. 

Lemma 13. Suppose C corresponds to a se- 
quence s in a condensed sequence graph G. If a 
read r bridges an X-node v, then there are unique 
edges (p, v) and (v, q) such that P~^v and v^*' 
are adjacent in r. 

SimpleBridging is described as follows. 

Algorithm 3 SimpleBridging. Input: reads 
TZ, parameter K. Output: sequence s. 
K-mer steps 1-3: 

1. For each subsequence x of length in a read, 
form a node with label x. 

2. For each read, add edges between nodes rep- 
resenting adjacent i^T-mers in the read. 



3. Condense the graph as described in Defn. 11 



4. Bridging step: See Fig. [7j While there exists 
an X-node v with din(v) = dout(v) = 2 bridged 
by some read r: (i) Remove v and edges incident 
to it. Add duplicate nodes vi,V2. (ii) Choose 
the unique pj and qj s.t. P*~^v and are ad- 
jacent in r and add edges (pi,vi) and (vi,qj). 
Choose the unused pj and q^, add edges (pj, V2) 
and (v2,qj). (iii) Condense the graph. 

5. Finishing step: Find an Eulerian cycle in the 
graph and return the corresponding sequence. 

C.2.3 Proofs for MultiBridging 

In this subsection we recall Theorem |6] stating 
sufficient conditions for correct reconstruction, 
and derive the corresponding required coverage 
depth and read length to meet a target proba- 
bility of correct reconstruction. The subsection 
concludes with a proof that the sufficient condi- 
tions are correct. 

Theorem [6]. The algorithm MultiBridging 
reconstructs the sequence s if: 
(a) all interleaved repeats are bridged 



GATTGCATTG 
I 1 

..GATTGCATTGCATTC. 

I 1 

CATTGCATTC 



TATTGCATTT.. 



GATT 



^« ATTC 




GATT 



CATTGCATT 



GATTGCATT 



TATTGCATT 



ATTGCATTG 



ATTGCATTC 



ATTGCATTT 



GATT 




GATTGCATT 



ATTGCATTC 



TATTGCATT 



ATTGCATTT 



Figure 18: Resolution of X-node with a self-loop. 

(b) all triple repeats are all-bridged 

(c) the sequence is covered by the reads. 

Remark 14. Unlike the previous K-mev algo- 
rithms, DeBruijn and SimpleBridging, it 
is unnecessary to specify a parameter K for 
MultiBridging. Implicitly MultiBridging 
uses K = 1, which makes the condition that 
reads overlap by K equivalent to coverage of the 
genome. 

Figure |2] plots the performance of MultiB- 
ridging, obtained by solving for the relationship 
between G, N, L, and e in order to satisfy the 
conditions of Lemma [6] We first perform the req- 
uisite calculations, and then prove the Lemma. 
Condition (a) is already dealt with in ([9| and 



(10 1, and Condition (c) amounts to the require- 

N 



ment that 



> 1. 
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We turn to Condition (b) that all triple re- 
peats are all-bridged. Let Cm denote the number 
of triple repeats of length m. A union bound es- 
timate over triple repeats for the event that one 
such triple repeat fails to be all-bridged gives 



-ferror ~ ^ ^ 3 • CmC ^ ^ , (1^) 

m 

and requiring Pcrror ^ e and solving for L yields 

(15) 



L>-log — 
A e 



G, 73 
— log — 
N ^ e 



where 73 := X)m ^Cme^^/*^^'*^™'"^^^ is computed 
from the triple repeat statistics Cm- 

In order to understand the cost of all-bridging 
triple repeats, compared to simply bridging one 
copy as required by our lower bound, it is in- 
structive to study the effect of the single longest 
triple repeat. Setting Ci^^-^^^ = 1 and Cm = for 
m / ^triple makes 73 = 3e(^/^)-('^'"p"=+^) in ^ 
and 



L> L 



all 



^triple + l + ^log3e-i. (16) 



Bridging the longest triple repeat, as shown in 
requires 



Section B.2 



L > U 



G 

^triple + ^ + ^ ^ 



-1 



(17) 



Solving for in equations (17) and (16) gives 

loge^^ 



Nf > G ■ 



The ratio is 



G _ 

3 L — ^triple ~ 1 

log e"-*^ -|- log 3 



L 



'-triple 



1 



A^3 



loge~^ 



3.72 for e = 10" 



(18) 
(19) 

(20) 



This means that if the longest triple repeat is 
dominant, then for L slightly larger than ^triplei 
MultiBridging needs a coverage depth approx- 
imately 3.72 times higher than required by our 
lower bound. 

The remainder of this subsection is devoted to 
the proving Lemma |6] 



We will use mc(v) to denote the multiplicity 
(traversal count) a cycle C assigns a node v. The 
multiplicity mc(v) is also equal to the number of 
times the subsequence v appears in the sequence 
corresponding to C. For an edge e, we can simi- 
larly let mc{e) be the number of times C traverses 
the edge. The following key lemma relates node 
multiplicities with the existence of X-nodes. 

Lemma 15. Let C he a cycle in a condensed 
sequence graph G, where G itself is not a cy- 
cle, traversing every edge at least once. If v is a 
node with maximum multiplicity at least 2, i.e. 
iTT'ci^) = niax^gG mc(u) > 2, then v is an X- 
node. As a consequence, if mc{v) > 3 for some 
V, i.e. C traverses some node at least three times, 
then mc(u) > 3 for some X-node u. 

Proof. Let v be a node with maximum multiplic- 
ity mc(v) = maXugGmc(u). We will show that 
V is an X-node, i.e. (iout(v) > 2 and (iin(v) > 2. 

We prove that (iout(v) > 2 by supposing that 
dout(v) = 1 and deriving a contradiction. Denote 
the outgoing edge from v by e = (v, u), where u 
is distinct from v since otherwise G is a cycle. If 
c^in(u) > 2, then u must be traversed more times 
than V, contradicting the maximality of mc(v), 
and if din(u) = 1, then the existence of the edge 
e contradicts the fact that G is condensed. The 
argument showing that (iin(v) > 2 is symmetric 
to the case (iin(v) > 2. □ 

Proof of Lemma We assume that all triple re- 
peats are all-bridged, that there are no unbridged 
interleaved repeats, and that all reads overlap 
their successors by at least 1 base pair. We wish 
to show that MultiBridging returns the origi- 
nal sequence. 

Consider the condensed sequence graph Go 
constructed in steps 1-3 of MultiBridging. 
Suppose all X-nodes that are either all-bridged 
or correspond to bridged 2-repeats have been re- 
solved according to repeated application of the 
procedure in step 4 of MultiBridging, result- 
ing in a condensed sequence graph G. We claim 
that 1) s corresponds to a cycle C in G traversing 
every edge at least once, 2) C is Eulerian, and 3) 
C is the unique Eulerian cycle in G. 
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Proof of Claim 1. Let G„ be the graph af- 
ter n resolution steps, and suppose that C„ is a 
cycle in G„ corresponding to the sequence s and 
traversing all edges. We will show that there ex- 
ists a cycle C„+i in G^+i corresponding to s and 
traversing all edges, and that = G for a finite 
t, so by induction, there exists a cycle C in G 
corresponding to s and traversing all edges. The 
base case n = was shown in Lemma [8] Moving 
on to arbitrary n > 0, let v be an X-node in G„ 
labeled as in Fig. [6] The X-node resolution step 
is constructed precisely to preserve the existence 
of a cycle corresponding to s. Each traversal of 
V by the cycle Cn assigns an incoming edge (piv) 
to an outgoing edge (v,qj), and the resolution 
step correctly determines this pairing by the as- 
sumption on bridging reads. 

Note that all X-nodes in the graph G„+i con- 
tinue to correspond to repeats in s. The process 
terminates: let C{Gi) = EvgG, "^c,(v)lmc^(v)>i 
and observe that £(Gj) is strictly decreasing in i. 
Thus s corresponds to a cycle C in G traversing 
each edge at least once. 



Proof of Claim 2. We next show that C is 
an Eulerian cycle. If G is itself a cycle, and s 
is not formed by concatenating multiple copies 
of a shorter subsequence (assumed not to be the 
case, see discussion at end of Section [2]), then 
C traverses G exactly once and is an Eulerian 
cycle. Otherwise, if G is not a cycle, then we 



may apply Lemma 15 to see that any node with 
mc{^) ^ 3 implies the existence of an X-node u 
with mc(u) > 3. Node u must be all-bridged, 
by hypothesis, which means that an additional 
X-node resolution step can be applied to G, a 
contradiction. Thus each node v in G has mul- 
tiplicity mc(v) < 2. 

We can now argue that no edge e = (u, v) is 
traversed twice by C in the condensed sequence 
graph G, as it would have been contracted. Sup- 
pose mc{e) > 2. The node u cannot have two 
outgoing edges as this implies mc(u) > 3; sim- 
ilarly, V cannot have two incoming edges. Thus 
f^outlu) = ciin(v) = 1, but by Defn. 11 the edge 
e = (u, v) would have been contracted. 



Proof of Claim 3. It remains to show that 
there is a unique Eulerian cycle in G. All X- 
nodes in G must be unbridged 2-X-nodes (cor- 
respond to 2-repeats in s), as all other X-nodes 
were assumed to be bridged and have thus been 
resolved in G. 

We will map the sequence s to another se- 
quence s', allowing us to use the characterization 
of Lemma [9] for SBH with known multiplicities. 
Denote by G' the graph obtained by relabeling 
each node in G by a single unique symbol (no 
matter the original node label length), and set- 
ting all edge overlaps to 0. Through the relabel- 
ing, C corresponds to a cycle C in G', and let s' 
be the sequence corresponding to C . Writing ^2 
for the 2-spectrum of s', the graph G' is by con- 
struction precisely the 1-mer graph created from 
and there is a one-to-one correspondence be- 
tween X-nodes in G' and unbridged repeats in 
s'. Through the described mapping, every un- 
bridged repeat in s' maps to an unbridged repeat 
in s, with the order of repeats preserved. 

There are multiple Eulerian cycles in G only if 
there are multiple Eulerian cycles in G' since the 
graphs have the same topology, and by Lemma [9] 
the latter occurs only if there are unbridged in- 
terleaved repeats in s', which by the correspon- 
dence in the previous paragraph implies the exis- 
tence of unbridged interleaved repeats in s . □ 

C.3 Truncation estimate for bridg- 
ing repeats (Greedy and Multi- 
Bridging) 

The repeat statistics and Cm used in the algo- 
rithm performance curves are potentially overes- 
timates. This is because a large repeat family — 
one with a large number of copies / — will re- 
sult in a contribution (2) ~ /^/2 to and 

{{) « f/6 to Cm. 

We focus here on deriving an estimate for the 
required A'^, L for bridging all repeats with prob- 
ability 1 — e. This upper bound reduces the sen- 
sitivity to large families of short repeats. The 
analogous derivation for all-bridging all triple- 
repeats is very similar and is omitted. 

Suppose there repeats of length m. The 

probability that some repeat is unbridged is ap- 
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proximately, by the union bound estimate, 

P(£:) pa ^ a,„e-2M^-™) . (21) 



Requiring < e and solving for L gives 



1 7 G 7 

L > — log — = log — 

- 2A ^ e 2N ^ e 



(22) 



where 7 := X)m ^mC^^^'''^^™'. Now, if am over- 
counts the number of repeats for small values of 



m, the bound in (22) might be loose. In order 



for each read to overlap the subsequent read by 
X nucleotides, with probability of failure e/2, it 
suffices to take 



L > Lk-coy{x, I) ■= x+j log 



2N 



(23) 



1 27(2;) e 
L > mmmaxj — log , LK-cov(a;, 77)} , 

(24) 

where 7(x) = J2ni>x ^m.e'^^^^'^''"^ , and obtain a 
looser bound. 



D Critical window calculations 

D.l Window size if interleaved > triple 

We focus here on the bound due to interleaved re- 
peats (rather than triple repeats, treated subse- 
quently) , and furthermore assume that the effect 
of the single largest interleaved repeat is domi- 
nant. In this case ^interleaved = -^crit — 1 is the 

length of the shorter of the pair of interleaved re- 
peats, and let £i be the length of the longer of the 
two. For Lcrit < L < £i -|- 1, we are in the setting 
of ^ but with a redefined 71 = e2(^/G')(^crit-i)_ 
Thus, 



L>Lr 



G , -1 
+ 2iV^°^^ 



and solving for N gives 



G loge 



-1 



repeat 



2 L 



(25) 



(26) 



Nij^{L*) := A^*. This is the minimum read 
length for which coverage of the sequence suf- 
fices for reconstruction. 

We now solve for J^. First, the Lander- 
Waterman equation (|2| at = A*"* is 



G , N* 



iV* = ^ loe — , (27) 
and setting equal the right-hand sides of (pTl) 



and (26) at L = L* gives 

G ^ N* G loge-i 

— log — = 

L* ^ e 2 L* -£2-1 

A bit of algebra yields 

L* 2 



L. 



crit 



Thus, for any x < L, we may replace (22) by where 



X :- 



2 

loge~^ 



X 



(28) 



log A^* -|- log e 



(29) 



Since x < h, equation (28) implies L* < 2L 



^crit ) 



and combined with the obvious inequality L* > 
-^crit) we have Lcrit < L* < 2 Lcrit- Thus 



iVLw(2Lcrit) <N* < A^Lw(^crit) 



(30) 



and applying the Lander- Waterman fixed-point 
equation ^ yet again gives 

G , A^Lw(2Lcrit) / ,r* , G Ni^w{Lciit) 
log < N < log - 



2L 



crit 



L, 



crit 



(31) 



Writing this out gives 
loge"^ 



log -r-: + log log 



-^'^Lw(-^crit) 



+ loge 



< X 



-1 



< 



loge ^ 



log 7^ - 1 + log log ^L^^^^"'^) + log e-i 



and this can be relaxed to 
log e^^ 



log — h log e~^ -I- log log -p- 
log e~^ 



< X 



(32) 



< 



log 



Let L* be the value of L at which the curve de- Letting 



scribed by constraint (26) intersects the Lander- 



Waterman coverage value, i.e. A'j 



repeat 



(L* 



G 



r :- 



1 -I- log e 



-1 



log 



G 



log 



(33) 
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we have to a very good approximation 
L* 2(r + l) 



L. 



crit 



2(r + l) -1 



(34) 



For G 10^, Lent ~ 1000, and e = 5%, we get 
log ~ 13.8 and log e^^ ~ 3.0, so r ~ 4.6 and 



crit 



2(r + l) 
2(r + l) - 1 



1.1. 



From ( 33 ) we see that the relative size of log e ^ 
and log determines the size of the critical 
window. If in the previous example e = 10~^, 



say, then 

-^crit 

zero, r approaches zero as well and 



increases to 1.3. As e tends to 

2. 



D.2 Window size if triple > ^interleaved 

We now suppose the single longest triple re- 
peat dominates the lower bound and estimate 
the size of the critical window. In this case 
triple = Lcrit — 1 is the length of the longest triple 
repeat. Since we don't have matching lower and 
upper bounds for triple repeats, we separately 
compute the critical window size for each. 

We start with the lower bound. For L > Lent, 
the minimum value of required in order to 



bridge the longest triple repeat is given by (18) 
and repeated here: 



Atriples 



G 
"3 



loge ^ 

L — Lcrii 



(35) 



For the same example as before, G ~ 10*^, Lcrit ~ 
1000, and e = 5%, we get r « 4.6 and 



L* 



L 



crit 



3(r + l) 
3(r + l) - 1 



1.06. 



L* 



1.17, and as e 

L* , 3 



Changing e to 10 ^ makes r 
(and hence also r) tends to zero, ^ . 2 • 

The analogous computation for L* /Lent for 
the upper bound, as given by Nf in (ITsJ), yields 



L* 



r + l 



crit 



r + 



log 3 



1.12. 



(37) 



for the example with G ~ 10*, Lcrit ~ 1000, 
and e = 5%. The critical window size of the 
upper bound is about twice as large as that of 
the lower bound for typical values of G and Lcrit , 
with e moderate. But as e 
that L* /Lcrit 

L* / Lcrit 



0, we see from (37) 



00, markedly different to the 
I observed for the lower bound. 



As for the interleaved repeats case considered 
earlier, we let L* be the value of L at which 



the curve described by constraint (35) inter 



sects the Lander- Waterman coverage value, i.e. 
Atripie(L*) = Alw(L*) := N*. This is the min- 
imum read length for which coverage of the se- 
quence suffices for reconstruction. 



A similar procedure as leading to (28) gives 

L* /Lcrit 



3/(3- 



- X 



with X defined in ( 29 ) . One 



can check that the estimates on x in (32) con- 
tinue to hold, and we therefore get 



L* 

Lcrit 



3(r + l) 
3(r + l) -1 



(36) 
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